assembly and viral quasispecy reconstruction
- North America > United States > Texas > Travis County > Austin (0.04)
- North America > Canada (0.04)
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
Graph Coloring via Neural Networks for Haplotype Assembly and Viral Quasispecies Reconstruction
Understanding genetic variation, e.g., through mutations, in organisms is crucial to unravel their effects on the environment and human health. A fundamental characterization can be obtained by solving the haplotype assembly problem, which yields the variation across multiple copies of chromosomes. Variations among fast evolving viruses that lead to different strains (called quasispecies) are also deciphered with similar approaches. In both these cases, high-throughput sequencing technologies that provide oversampled mixtures of large noisy fragments (reads) of genomes, are used to infer constituent components (haplotypes or quasispecies). The problem is harder for polyploid species where there are more than two copies of chromosomes. State-of-the-art neural approaches to solve this NP-hard problem do not adequately model relations among the reads that are important for deconvolving the input signal. We address this problem by developing a new method, called NeurHap, that combines graph representation learning with combinatorial optimization. Our experiments demonstrate the substantially better performance of NeurHap in real and synthetic datasets compared to competing approaches.
A Convolutional Auto-Encoder for Haplotype Assembly and Viral Quasispecies Reconstruction
Haplotype assembly and viral quasispecies reconstruction are challenging tasks concerned with analysis of genomic mixtures using sequencing data. High-throughput sequencing technologies generate enormous amounts of short fragments (reads) which essentially oversample components of a mixture; the representation redundancy enables reconstruction of the components (haplotypes, viral strains). The reconstruction problem, known to be NP-hard, boils down to grouping together reads originating from the same component in a mixture. Existing methods struggle to solve this problem with required level of accuracy and low runtimes; the problem is becoming increasingly more challenging as the number and length of the components increase. This paper proposes a read clustering method based on a convolutional auto-encoder designed to first project sequenced fragments to a low-dimensional space and then estimate the probability of the read origin using learned embedded features. The components are reconstructed by finding consensus sequences that agglomerate reads from the same origin. Mini-batch stochastic gradient descent and dimension reduction of reads allow the proposed method to efficiently deal with massive numbers of long reads. Experiments on simulated, semi-experimental and experimental data demonstrate the ability of the proposed method to accurately reconstruct haplotypes and viral quasispecies, often demonstrating superior performance compared to state-of-the-art methods. Source codes are available at https://github.com/WuLoli/CAECseq.
Graph Coloring via Neural Networks for Haplotype Assembly and Viral Quasispecies Reconstruction
The pseudocode for the NeurHap-refine is as follows: Algorithm 1: The Local Refinement Algorithm NeurHap-refine. Two categories of datasets are used in the paper, Polyploid species and Viral Quasispecies . BW A-MEM [Li, 2013] is used to align reads to the reference genome. The detailed command is (take the 15-strain ZIKV as an example): $ ./bwa Vikalo, 2020a,b] to derive the SNP matrix from the above alignment to ensure a fair comparison.
- Oceania > Australia > Australian Capital Territory > Canberra (0.04)
- Asia > Singapore > Central Region > Singapore (0.04)
- North America > United States > Texas > Travis County > Austin (0.04)
- North America > Canada (0.04)
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
Review for NeurIPS paper: A Convolutional Auto-Encoder for Haplotype Assembly and Viral Quasispecies Reconstruction
Weaknesses: As noted below, (1) it is not clear what the contribution of their method is, and (2) necessary details are missing from their evaluations. Also, at some points of their approach in preparing the dataset, they used some restrictions on some parameters but they didn't mention why they used these numbers. This is concerning, so I think they need to clarify why they used those parameters. For example, for TB they mention that: "Read alignment is performed using BWA-MEM (Li and Durbin, 2009), where the reads with mapping scores lower than 40 are filtered out for quality control." But for HIV they choose different approach: "Reads with mapping score lower than 60 and length shorter than 150 bp are filtered out for quality control".
Graph Coloring via Neural Networks for Haplotype Assembly and Viral Quasispecies Reconstruction
Understanding genetic variation, e.g., through mutations, in organisms is crucial to unravel their effects on the environment and human health. A fundamental characterization can be obtained by solving the haplotype assembly problem, which yields the variation across multiple copies of chromosomes. Variations among fast evolving viruses that lead to different strains (called quasispecies) are also deciphered with similar approaches. In both these cases, high-throughput sequencing technologies that provide oversampled mixtures of large noisy fragments (reads) of genomes, are used to infer constituent components (haplotypes or quasispecies). The problem is harder for polyploid species where there are more than two copies of chromosomes.
A Convolutional Auto-Encoder for Haplotype Assembly and Viral Quasispecies Reconstruction
Haplotype assembly and viral quasispecies reconstruction are challenging tasks concerned with analysis of genomic mixtures using sequencing data. High-throughput sequencing technologies generate enormous amounts of short fragments (reads) which essentially oversample components of a mixture; the representation redundancy enables reconstruction of the components (haplotypes, viral strains). The reconstruction problem, known to be NP-hard, boils down to grouping together reads originating from the same component in a mixture. Existing methods struggle to solve this problem with required level of accuracy and low runtimes; the problem is becoming increasingly more challenging as the number and length of the components increase. This paper proposes a read clustering method based on a convolutional auto-encoder designed to first project sequenced fragments to a low-dimensional space and then estimate the probability of the read origin using learned embedded features.